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1  Introduction 

The  VSIPL  (the  Vector,  Signal,  and  Image  Processing  Library)  specification 
defines  a  portable,  C  programming  language  interface  to  use  in  linear  algebra 
and  signal-processing  applications.  The  VSIPL  standard  has  been  implemented 
by  a  variety  of  vendors.  VSIPL’s  portable  interface  provides  developers  the 
ability  to  write  code  once  and  reuse  it  in  multiple  environments. 

At  HPEC  2002,  we  presented  an  overview  of  VSIPL-I— 1-,  a  C-|— I-  specification 
designed  to  perform  the  same  types  of  computations  as  VSIPL.  The  primary 
goals  for  VSIPL-I— I-  are  improved  serial  performance  relative  to  VSIPL,  support 
for  multi-processor  systems,  extensibility,  and  simpler  syntax. 

The  serial  VSIPL-I— I-  specification  is  virtually  complete.  By  HPEC  2003, 
we  expect  to  have  a  successful  implementation  of  the  specification.  We  antic¬ 
ipate  that  the  performance  of  the  VSIPL-I— I-  reference  implementation  will  be 
superior  to  that  of  VSIPL  for  some  applications.  By  HPEC  2003,  the  refer¬ 
ence  implementation  of  VSIPL-I— I-  will  contain  preliminary  support  for  parallel 
systems. 

Our  presentation  will  compare  the  performance  of  VSIPL-I— I-  with  VSIPL, 
and  demonstrate  the  VSIPL-I— I-  support  for  parallel  computation.  We  will  also 
discuss  VSIPL-I— I-  implementation  strategies,  including  the  use  of  an  exist¬ 
ing  VSIPL  implementation,  a  native  C-|— I-  implementation  using  expression- 
templates,  and  a  hybrid  approach  that  allows  an  implementor  to  incrementally 
reimplement  portions  of  VSIPL-I — h  to  achieve  higher  performance. 


2  Performance  Comparisons 

VSIPL-I — h  can  be  implemented  on  either  uni-processor  or  multi-processor  hard¬ 
ware.  As  a  first  step,  we  are  implementing  VSIPL-I— I-  using  an  existing  C 
VSIPL  library.  While  this  implementation  is  straightforward,  the  performance 
is  of  course  limited  by  the  performance  of  the  underlying  VSIPL  implementa¬ 
tion.  We  also  have  a  preliminary  implementation  of  some  portions  of  VSIPL-I — h 
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using  a  high-performance  expression-template  technique.  By  HPEC  2003,  we 
plan  to  have  a  partial  parallel  implementation  of  VSIPL-I— 1-. 

We  are  using  a  simple  FIR-filter  and  narrowband  beamforming  applica¬ 
tion  as  a  benchmark.  These  computations  are  fundamental  to  many  signal¬ 
processing  applications.  We  plan  to  present  performance  comparisons  between 
VSIPL,  VSIPL-I— I-  built  atop  VSIPL,  VSIPL-I— I-  using  expression  templates, 
and  VSIPL-I— I-  using  multiple  processors. 


3  Parallel  Computation  Model 

VSIPL-I--I-  uses  a  Single  Program  Multiple  Data  (SPMD)  model  when  perform¬ 
ing  parallel  computations.  The  VSIPL-I— I-  model  divides  rectangular  arrays  of 
data  (known  as  “blocks” )  into  sections  using  combinations  of  block  and  cyclic 
data  distributions.  We  will  explain  the  VSIPL-I— I-  model,  and  demonstrate 
how  a  very  simple  distribution  model  can  accommodate  systems  ranging  from 
small  embedded  systems  to  large  systems  with  thousands  of  nodes.  We  will  also 
explain  how  a  wide  variety  of  distribution  policies  can  be  implemented  atop 
the  simple  distribution  model  provided  by  VSIPL-I— I- .  Finally,  we  will  explain 
how  the  VSIPL-I--I-  parallelism  model  provides  support  for  fault-tolerance  via 
dynamic  reallocation  of  processors. 
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Design  Path 


specification  Status 


•  Serial  Specification 

-  2 1 6-page  draft. 

-  Under  review  by  VSIPL  Forum. 

•  Parallel  Specification 

-  24-page  preliminary  draft. 

-  Initial  conceptual  review  complete. 
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Serial  Performance 


•  Uses  VSIPL  reference  implementation. 

-  Not  the  fastest  implementation. . . 

-  . . .  but  the  relative  performance  is  important. 

•  Environment: 

-  2GHz  Pentium-M 

-  512KB  cache,  512MB  RAM 

-  GNU/Linux,  G++  3.4 
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Matrix/V  ector 


V  +=  mv 
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Matrix/Matrix 


result  +=  tan(sin(m)  +  cos(m)) 
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Checked  Vector  Access 
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Performance  Conclusions 


•  V SIPL++  has  approximately  zero  overhead. 

-  Memory  effects  actually  enable  V SIPL++  to 
outperform  VSIPL. 

-  Expression-template  techniques  may  also  improve 
performance. 

•  Exceptions  are  expensive. 

-  We  are  not  sure  if  this  overhead  can  be  eliminated. 

•  Reference  implementation  will  be  directly  useful. 

-  V endor-optimized  versions  will  probably  be  better. 
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Parallelism 


•  Target  systems: 

-  Support  1-64K+ processors. 

-  Support  MPI,  POSIX  threads. 

•  Conceptual  model: 

-  Single-program  multiple-data  model. 

-  Owner  computes. 

-  Parallelism  requires  changing  only  declarations,  not 
expressions. 
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Parallel  VSIPL++  Model 


viewO  viewl  view2  view3 


A 


user 

program 


j 

■  hardware 


CooeSourcerj 


10 


www.codesourcerv.com 


Using  Parallelism 


•  Declaration: 

Vector<double , 

Dense<l,  double, 

Map<Block>  >  > 

V  (17,  1.0,  Block(4) ) ; 

•  Meaning: 

-  1 7 :  V ector  length. 

-  1.0:  Intial  value. 

-  Block  ( 4 )  :  Block  distribution  over  4  processors. 
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FY04  Objectives 


•  Specification: 

-  Finalize  serial  and  parallel  specifications. 

-  Get  approval  from  VSIPL  Forum. 

•  Implementation: 

-  Finish  serial  implementation. 

-  Draft  parallel  implementation. 

•  Measurement: 

-  Performance  analysis. 
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Contact  Information 


•  Mark  Mitchell 
mark@codesourcerv.com 

•  Jeffrey  Oldham 
oldham@codesourcerv.com 

•  Nathan  Sidwell 
nathan@codesourcerv.com 
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